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Genome-wide association studies have been successful in identifying loci contributing effects to a range of complex human 
traits. The majority of reproducible associations within these loci are with common variants, each of modest effect, which 
together explain only a small proportion of heritability. It has been suggested that much of the unexplained genetic component 
of complex traits can thus be attributed to rare variation. However, genome-wide association study genotyping chips have 
been designed primarily to capture common variation, and thus are underpowered to detect the effects of rare variants. 
Nevertheless, we demonstrate here, by simulation, that imputation from an existing scaffold of genome-wide genotype data 
up to high-density reference panels has the potential to identify rare variant associations with complex traits, without the 
need for costly re-sequencing experiments. By application of this approach to genome-wide association studies of seven 
common complex diseases, imputed up to publicly available reference panels, we identify genome-wide significant evidence 
of rare variant association in PRDM10 with coronary artery disease and multiple genes in the major histocompatibility 
complex (MHC) with type 1 diabetes. The results of our analyses highlight that genome-wide association studies have the 
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INTRODUCTION 

There has been much recent debate as to the role of rare ge- 
netic variation, defined here to have a minor allele frequency 
(MAF) of less than 1%, in explaining the 'missing heritabil- 
ity' of complex traits [Dickson et al, 2010; Frazer et al., 2009; 
Yang et al., 2010]. Rare variants are likely to have originated 
from founder effects in the last 20 generations, and thus are 
more likely to be population specific [Bodmer and Bonilla, 
2008]. They are also likely to have larger effects on complex 
traits than common variants, consistent with the expecta- 
tion that they will have been subject to purifying selection 
after recent expansion of the human population [Pritchard, 
2001]. However, these effects are unlikely to be sufficiently 
large to be detected through association testing with indi- 
vidual rare variants. Statistical methods have thus focussed 
on the aggregation of the effects of all rare variants within 
the same exon, gene or pathway, potentially weighting ac- 
cording to annotation or MAF [Han and Pan, 2010; Hoffman 
et al., 2010; Li et al, 2010; Li and Leal, 2008; Madsen and 
Browning, 2009; Morgenthaler and Thilly, 2007; Morris and 



Zeggini, 2010; Neale et al., 2011; Price et al., 2010; Wu et al, 
2011; Zawistowski et al, 2010]. Using these methods, mul- 
tiple rare variants have been demonstrated to be associated 
with a variety of complex traits including low- and high- 
density lipoprotein [Cohen et al, 2006; Romeo et al., 2007], 
body mass index [Ahituv et al., 2007] and blood pressure [Ji 
etal.,2008]. 

The most comprehensive approach to characterising the 
contribution of rare variants to the genetic component 
of complex traits is through large-scale, next-generation 
re-sequencing studies [Metzker, 2010]. Despite improve- 
ments in the throughput and efficiency of these technolo- 
gies, rare variant re-sequencing efforts on the scale of the 
whole genome still represent an infeasible financial un- 
dertaking for most research groups. Consequently, most 
rare variant studies have focussed on candidate genes, 
or more recently, the exome [Ng et al., 2010]. However, 
high-density reference panels obtained from whole-genome 
re-sequencing data are being released through the 1000 
Genomes Project, providing a comprehensive catalogue 
of variation with MAF as low as 0.5%, as well as many 
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rarer variants, across a wide range of populations from 
different ethnic groups [The 1000 Genomes Project Con- 
sortium, 2010]. Such reference panels could be utilised to 
select rare variants for large-scale genotyping with custom 
designed arrays, potentially with priority given to variants 
with likely functional consequences in an effort to reduce 
costs, such as the Illumina Infinium HumanExome Bead- 
Chip. Conversely, genome-wide association study (GWAS) 
genotyping products have been designed primarily to cap- 
ture common genetic variation, and thus offer poor cover- 
age of rare variants [Barrett and Cardon, 2006]. However, 
if samples have already been assayed by means of such a 
GWAS chip, imputation techniques [Marchini and Howie, 
2010] can make use of this existing scaffold to predict geno- 
types at variants present on the higher density reference 
panel, incurring no additional cost, other than computation, 
although this is far from trivial. 

Here, we formulate methodology for the detection of 
complex trait association with accumulations of minor alle- 
les within genes, or some other functional unit, using data 
from directly typed and/or imputed rare variants. We re- 
port the results of simulations to investigate the power of 
alternative design strategies for assaying and characterising 
rare genetic variation to detect association with a quantita- 
tive trait in a 50 kb gene. Our study considers a simple model 
of rare variant association with the trait, and assesses the im- 
pact on power of the number of individuals present in the 
reference panel. We also present results of an application 
of our methodology to rare variant association analysis of 
seven common complex diseases undertaken by the Well- 
come Trust Case Control Consortium (WTCCC) [The Well- 
come Trust Case Control Consortium, 2007], using GWAS 
data imputed up to the Phase 1 1000 Genomes Project refer- 
ence panel (June 2011 interim release) [The 1000 Genomes 
Project Consortium, 2010]. 

METHODS 

MODEL FORMULATION 

We test for association of a complex trait with accumula- 
tions of minor alleles at N rare variants, here defined to have 
MAF less than 1%, within the same exon, gene or some other 
functional unit, in a sample of unrelated individuals. Let w, 
denote the number of rare variants at which the ith individ- 
ual has been successfully genotyped in the functional unit. 
Furthermore, let G,y denote the genotype of this individual 
at the ;th rare variant, coded as 0 for the common homozy- 
gote, and 1 otherwise. We can model the phenotype, y„ of 
this individual in a generalised linear regression framework 
as a function of the proportion of rare variants at which they 
carry at least one minor allele [Morris and Zeggini, 2010], 
given by 

N 

Pi =J2G ij /n i . 

;=i 

Specifically, E(y,) = §" _1 (ct + Pp,), where g is the link func- 
tion and p is the expected increase in the phenotype for an 
individual carrying a full complement of minor alleles at 
rare variants in the functional unit compared to an individ- 
ual carry none. It follows that P/N is the increase in the 
expected phenotype of an individual for each rare variant 
at which they carry a minor allele. The likelihood contribu- 



tion of the ith individual, f(y, \a, p, p,), is weighted by n, 
to allow for differential call rates between samples. We can 
thus construct a likelihood ratio test by comparing the max- 
imised weighted likelihoods of two models via analysis of 
deviance: (i) the null model where p = 0; and (ii) the alterna- 
tive model for which p is unconstrained. The resulting test 
statistic has an approximate x 2 distribution with one degree 
of freedom. The flexibility of the generalised linear regres- 
sion framework allows for generalisation of this approach 
to take account of non-genetic risk factors as covariates. 

It is straightforward to accommodate imputed rare vari- 
ants in the functional unit within the generalised linear 
regression framework by considering the posterior distri- 
bution of genotype calls. Specifically, we replace missing 
genotypes (at typed variants, or at untyped variants in the 
imputation reference panel) by their expectation, E(Gy) = 
1 - hij, where hq is the posterior probability of a common 
homozygote call in the ith individual at the jih rare vari- 
ant. The posterior probabilities, hij, of imputed genotype 
calls are easily recovered from standard imputation soft- 
ware such as IMPUTEv2 [Howie et al., 2009] and BEAGLE 
[Browning and Browning, 2009]. 

The methodology described above has been implemented 
in the GRANVIL software, and is freely available for down- 
load (http://www.well.ox.ac.uk/GRANVIL). The open- 
source software has been designed to efficiently handle 
analysis of directly genotyped and imputed rare variants 
on a genome-wide scale, and can accommodate both quan- 
titative traits and binary phenotypes in a generalised linear 
modelling framework, as described above. The user sup- 
plies a file containing the boundaries of each functional 
unit to be analysed, together with SNPTEST format sam- 
ple and genotype files, and specifies the rare variant MAF 
threshold. GRANVIL is distributed with scripts to generate 
graphical summaries of GWAS rare variant analysis results, 
including quantile-quantile and Manhattan plots. 

SIMULATION STUDY 

We have performed simulations to evaluate the relative 
performance of different design strategies to identify quan- 
titative trait association with rare variants in a 50 kb gene. 
We have considered an analysis cohort of 2,000 individuals, 
and a reference panel ascertained from the same population. 
We have compared the power of GRANVIL in the following 
scenarios: (i) direct re-sequencing of the analysis cohort; (ii) 
direct genotyping of the analysis cohort for all rare variants 
present in the reference panel; (iii) direct genotyping of the 
analysis cohort for variants on a GWAS chip with the same 
characteristics as the Illumina Human660W-Quad BeadAr- 
ray in terms of density and MAF; and (iv) direct genotyping 
of the analysis cohort for variants on the GWAS chip, sup- 
plemented with imputation of untyped variants present in 
the reference panel using IMPUTEv2 [Howie et al., 2009]. 

We have considered a simple underlying model for the 
association of the trait with multiple rare causal variants 
within the same gene. We assumed that the expected trait 
value of each individual is increased by the presence of 
a minor allele at any causal variant. The trait association 
model was then parameterised in terms of: (i) the maxi- 
mum MAF, 8, of any individual causal variant; (ii) the total 
MAF, Q, of all causal variants in the region; and (iii) their 
joint contribution to the overall trait variance, expressed 
as 100X.%. Here, we considered reference panels of R — 
120, R = 500 and R = 4,000 individuals. The number of 
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individuals was chosen to represent a range of reference 
panels incorporating those available from the 1000 Genomes 
Project (pilot release), through to those we might expect 
from future large-scale deep re-sequencing efforts, such as 
the UK10K initiative (http://www.uklOk.org/). For each 
model, we generated 500 replicates of data as follows. 

(1) Generate an ancestral recombination graph [Griffiths 
and Marjoram, 2007] for a population of 40,000 haplo- 
types from a realisation of the coalescent process with 
recombination, obtained using the MS software [Hud- 
son, 2002]. We assumed a mutation rate of 10~ 8 per base 
(in each generation) and a uniform recombination rate 
of 1 cM per Mb, for an effective population size of 10,000 
individuals. In total, we simulated a region of 1,050 kb, 
including a 50 kb gene and 500 kb up- and down-stream 
to allow for an imputation buffer to improve accuracy 
by avoiding edge effects and taking advantage of the 
expected long-range linkage disequilibrium (LD) with 
rare variants [The International HapMap Consortium, 
2007]. 

(2) Calculate the MAF at each variant across the 50 kb gene 
in the population of 40,000 chromosomes, denoted by 
qj for the ;th variant. Select a variant as causal from 
amongst those with MAF qj < 8, at random. Continue 
selecting causal variants in this way, without replace- 
ment, until the total MAF of all causal variants is Q. 

(3) Select a random sample of 4,000 chromosomes from the 
population, paired together to form the analysis cohort. 
Determine the number of minor alleles carried by the fth 
individual across all causal variants in the 50 kb gene, 
denoted by m,-. The phenotype, y,, of the rth individual 
is then simulated from a Gaussian N(|x,,a) distribution, 
where <r is determined by the spectrum of causal vari- 
ants and their joint contribution, X, to the overall trait 
variance, and jul, = 1 if m, > 0, and 0 otherwise. Full 
details of the derivation of the residual trait variance, 
cr 2 , are provided in the Appendix. 

(4) Select a random sample of 2R chromosomes from the 
remainder of the population to be haplotypes in the 
reference panel. Assuming no genotyping or phasing 
errors in the reference panel, record the haplotype of 
each of these chromosomes across all variants in the 
1,050 kb region. 

(5) Begin by considering the strategy in which the analysis 
cohort has been directly re-sequenced in the 50 kb gene. 
Assuming no sequencing errors, record the genotype of 
each individual at each variant with MAF < 1% in the 
analysis cohort. Test for association of the quantitative 
trait with an accumulation of minor alleles at these vari- 
ants using GRANVIL, and record the P-value, denoted 
by Pseq- 

(6) Then consider the scenario in which the analysis cohort 
has been directly genotyped for all variants in the 50 
kb gene which are present in the reference panel. As- 
suming no genotyping errors, record the genotype of 
each individual at each variant present in the reference 
panel with MAF < 1% in the analysis cohort. Test for 
association of the quantitative trait with an accumula- 
tion of minor alleles at these variants using GRANVIL, 
and record the P-value, denoted by Pgen- 

(7) Next consider the scenario in which the analysis co- 
hort has been directly genotyped only for variants on 
a GWAS chip. Select a random 1,050 kb region of the 
genome, and determine the number of variants, «gwas, 



present on the Illumina Human660W-Quad BeadArray 
in that interval. Select n GWAS variants at random and 
without replacement, with ascertainment probability 
§j — 4*7,(1 — q j)/T, as present on the chip, where T — 
qj(l — qj). This probability density incorporates the 
strong bias towards common variants on GWAS chips, 
generating an approximately uniform distribution of 
MAF [Anderson et al., 2008]. Assuming no genotyping 
errors, record the genotype of each individual at each 
variant within the 50 kb gene with MAF < 1% in the 
analysis cohort. Test for association of the quantitative 
trait with an accumulation of minor alleles at these vari- 
ants using GRANVIL, and record the P-value, denoted 
by pcms. 

(8) Finally, consider the scenario in which the analysis co- 
hort has been directly genotyped only for variants on 
the GWAS chip, but are subsequently supplemented 
with imputation of untyped variants present in the ref- 
erence panel. Assuming no genotyping errors, record 
the genotype of each individual at each variant across 
the 1,050 kb region, irrespective of MAF. Impute the 
genotype of each individual in the analysis cohort at 
each variant present in the reference panel in the 50 kb 
gene using IMPUTEv2 [Howie et al., 2009], assuming 
an effective population size of 10,000 individuals and a 
buffer region of 500 kb. Test for association of the quan- 
titative trait with an accumulation of minor alleles at 
directly genotyped and imputed variants within the 50 
kb gene with MAF < 1% in the analysis cohort and 'info 
score' greater than 0.4 using GRANVIL, and record the 
P-value, denoted by piMP- 

Over all simulated data sets, we calculated the power 
of each design strategy at a nominal 5% significance level 
as the proportion of replicates for which the correspond- 
ing P-value is less than 0.05. We also calculated the mean 
numbers of rare variants in the 50 kb gene: (i) in the popu- 
lation of 40,000 chromosomes; (ii) identified through direct 
re-sequencing of the analysis cohort; (iii) present on the ref- 
erence panel and identified through direct genotyping of the 
analysis cohort; (iv) present on the GWAS chip; (v) present 
on the reference panel and well imputed (info score greater 
than 0.4) in the analysis cohort. 

One potential limitation of our simulation study is the 
assumption of no sequencing errors, perfect phasing of the 
reference panel, and no missing or miscalled genotypes. 
These errors might be expected to be most detrimental to our 
proposed imputation strategy since this process requires 
an accurate GWAS scaffold and phased reference panel. To 
address this issue, we have also performed simulations to 
assess the robustness of our results to re-sequencing errors 
and missing or miscalled genotypes. We introduced a sim- 
ple model of errors [Hao et al., 2004] by randomly swapping 
the base call (ancestral or mutant) of each individual at each 
variant with probability e. Here, we considered e = 0% (no 
errors), £ = 0.1% (error rate of ~0.2% in genotype calls) 
and e = 0.2% (error rate of ~0.4% in genotype calls). We 
evaluated the impact of call rate by removing genotypes, 
at random, with probability k, where here we considered 
k = 0% (no missing genotype data) and k = 1%. Errors 
were introduced first into the reference panel (Step 4), and 
then into the analysis cohort (Steps 5, 6, 7 and 8). Missing 
genotype data were then introduced at random (Steps 6, 
7 and 8). 
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APPLICATION TO IMPUTED RARE VARIANT 
GWAS OF SEVEN COMMON COMPLEX 
DISEASES 

We considered 14,000 cases of seven common complex 
diseases (bipolar disorder, coronary artery disease, Crohn's 
disease, hypertension, rheumatoid arthritis, type 1 diabetes 
and type 2 diabetes) and 3,000 shared controls from the 
WTCCC [The Wellcome Trust Case Control Consortium, 
2007]. Samples were ascertained from the United King- 
dom and genotyped using the Affymetrix GeneChip 500K 
Mapping Array Set, which incorporates 500,568 single nu- 
cleotide polymorphisms (SNPs) genome-wide. We utilised 
the same quality control (QC) filters employed by the 
WTCCC to exclude samples and SNPs from the analysis, 
full details of which are presented in the description of the 
experiment [The Wellcome Trust Case Control Consortium, 
2007]. In brief, samples were excluded on the basis of low 
call rate, outlying genome-wide heterozygosity, discrepan- 
cies in WTCCC and external identifying information, non- 
European ancestry, duplication and apparent relatedness. 
SNPs were excluded on the basis of low call rate, extreme 
deviation from Hardy-Weinberg equilibrium (HWE), dif- 
ferential allele or genotype frequencies between the two 
control cohorts and manual visual inspection of genotype 
calls in cluster plots. 

To allow for fine-scale population structure, which may 
have greater impact on rare variant association signals than 
common SNPs because of recent founder effects [Bodmer 
and Bonilla, 2008], we constructed principal components to 
represent axes of genetic variation within the UK. We ap- 
plied EIGENSTRAT [Price et al., 2006] to a subset of high- 
quality LD-pruned SNPs (r 2 < 0.2) with MAF of at least 5%, 
and projected samples onto principal components demon- 
strating clear separation between 12 UK regions of residence 
[The Wellcome Trust Case Control Consortium, 2007]. 

We imputed the high-quality samples up to the Phase I 
1000 Genomes Project reference panel (June 2011 interim 
release) consisting of 1,094 phased individuals from mul- 
tiple ancestry groups [The 1000 Genomes Project Consor- 
tium, 2010]. We removed SNPs with MAF < 1% from 
the GWAS scaffold, prior to imputation, since we expect 
these variants are likely to be subject to higher geno- 
typing errors, which may impact the downstream analy- 
sis. We performed imputation using IMPUTEv2 [Howie 
et al., 2009] with default parameter settings and sample 
pre-phasing, allowing a buffer region of 500 kb. Subse- 
quently, we tested for association of each disease with 'high- 
quality' rare variants (MAF less than 1%, and IMPUTEv2 
info score greater than 0.4) within genes using GRANVIL, 
adjusting for principal components as covariates to account 
for fine-scale UK population structure. Gene boundaries 
were defined from the UCSC human genome database 
(build 37). 



RESULTS 

SIMULATION STUDY 

Figure 1 presents the power, at a nominal significance 
level of P < 0.05, to detect association with a quantitative 
trait, for each of the design strategies for assaying rare ge- 
netic variation in the gene. For these results, we assumed 
that multiple rare causal variants in the gene jointly con- 



tribute to 5% of the overall trait variation. The panels cor- 
respond to two specific trait association models: (A) the 
maximum MAF of any individual causal variant is 1%, 
and the total MAF of all causal variants is 5%; and (B) 
the maximum MAF of any individual causal variant is 
0.5%, and the total MAF of all causal variants is 2%. Un- 
der the second of these models, we expect fewer rare vari- 
ants within the gene to be causal, since the total MAF is 
lower. A higher proportion of non-causal variants within 
the gene would be expected to reduce power overall, ir- 
respective of the design strategy and /or the number of 
individuals in the reference panel [Morris and Zeggini, 
2010]. 

Our results highlight a number of general conclusions 
across these trait association models. As expected, the most 
powerful strategy to detect rare variant association is to re- 
sequence the analysis cohort. In the absence of sequencing 
errors, this 'gold-standard' strategy provides complete cov- 
erage of rare genetic variation in the gene within the analysis 
cohort. Nevertheless, genotyping the analysis cohort for all 
rare variants present in the reference panel generally results 
in a relatively small reduction of power, particularly for R 
— 4,000. We expect most of the rare variation in the analysis 
cohort to be captured by such large reference panels (Sup- 
porting Information Figure SI). Rare variants not captured 
by the reference panel (e.g., private mutations) are less likely 
to have a major impact on the joint contribution of causal 
variation in the gene under our simulation model, and thus 
would not be expected to lead to a dramatic reduction in 
power. 

As previously reported [Morris and Zeggini, 2010], geno- 
typing of the analysis cohort with the GWAS chip alone 
has minimal power to detect association because very few 
rare variants within the gene are assayed directly (Support- 
ing Information Figure SI). However, imputation into this 
GWAS scaffold in the analysis cohort up to the density of 
the reference panel can lead to substantial gains in power to 
detect rare variant association within the gene. The extent 
of the increase in power depends crucially on the number of 
individuals in the reference panel, although the gains from R 
= 500 toR = 4,000 are not as great as from R = 120 to R = 500, 
particularly for an association model incorporating causal 
variants with MAF up to 1% (Figure 1). Reference panels 
with more individuals provide more comprehensive cover- 
age of rare variation in the region (Supporting Information 
Figure SI), higher quality imputation, and thus greater im- 
provements in power. Note that the relative power of im- 
putation appears lower under trait association model (A), 
where the maximum MAF of any causal variant is lower 
than under model (B). This is not unexpected since the dis- 
tribution of causal allele frequencies will be more skewed 
to the rarest variants under this model, which we anticipate 
to be most difficult to impute, irrespective of the size of the 
reference panel. 

We also considered the impact of sequencing errors 
and missing and /or miscalled genotypes on the power 
of the four alternative strategies. As expected, the power 
of all strategies is decreased as the error rates and 
the frequency of missing genotypes increase (Support- 
ing Information Figure S2). However, we are still able 
to recover much of the power of the gold-standard re- 
sequencing strategy through imputation of the analysis 
cohort from the GWAS scaffold, and we maintain con- 
siderable advantages over genotyping of the GWAS chip 
alone. 
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Fig. 1. Power, at a nominal significance level of P < 0.05, to detect association of an accumulation of minor alleles with a quantitative trait, 
for different strategies for assaying rare genetic variation in a 50 kb gene, as a function of the size of the reference panel. Multiple causal 
variants in the gene contribute jointly to 5% of the overall trait variation. The panels correspond to two specific trait association models: 
(A) the maximum MAF of any individual causal variant is 1%, and the total MAF of all causal variants is 5%; and (B) the maximum MAF 
of any individual causal variant is 0.5%, and the total MAF of all causal variants is 2%. 



APPLICATION TO IMPUTED RARE VARIANT 
GWAS OF SEVEN COMMON COMPLEX 
DISEASES 

A total of 13,241 cases and 2,938 controls from the WTCCC 
experiment passed sample QC filters (Supporting Infor- 
mation Table SI). Of the autosomal variants on the array, 
456,868 passed SNP QC filters. We then applied EIGEN- 
STRAT to an LD-pruned (r 2 > 0.2) set of 27,770 high-quality 
autosomal SNPs with MAF > 5% to construct 10 axes of 
genetic variation of UK population structure. By projecting 
samples onto the corresponding principal components, we 
observed that the first three axes of genetic variation were 
strongly associated with the region of residence of sam- 
ples (Figure 2). The first principal component separated 
London and Scotland from the remainder of the United 
Kingdom, whilst the second and third principal compo- 
nents separated regions within the United Kingdom on a 



North-West to South-East axis. These three principal com- 
ponents were thus selected for adjustment of downstream 
association analyses to allow for fine-scale UK population 
structure. 

After removal of variants with MAF < 1%, a total of 
391,060 high-quality SNPs remained in the GWAS scaffold. 
A total of 8,239,134 rare variants were successfully imputed 
up to the Phase I 1000 Genomes Project reference panel 
(June 2011 interim release) [The 1000 Genomes Project Con- 
sortium, 2010] and were polymorphic in the WTCCC exper- 
iment. Of these, 5,383,228 (65.3%) had IMPUTEv2 info score 
of at least 0.4. Amongst these 'well-imputed' rare variants, 
the mean info score was 0.618, and 17.3% had info score 
greater than 0.8. 

Figure 3 presents Manhattan plots to summarise the asso- 
ciation of each disease with accumulations of minor alleles 
at well-imputed rare variants within genes, after correction 
for the three axes of genetic variation as covariates in the 
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Fig. 2. Principal components representing axes of genetic variation demonstrating clear separation between 12 UK regions of residence. 
Each point represents the mean projection of samples from each UK region onto the first three axes of genetic variation. 



logistic regression model. In these Manhattan plots, each 
point represents a gene (as defined by the UCSC human 
genome database), and those achieving genome-wide sig- 
nificance (Bonferroni correction for 30,000 genes, P < 1.7 x 
10~ 6 ) are highlighted in red. There was no evidence of resid- 
ual population structure, not accounted for by the three axes 
of genetic variation, with genomic control inflation factors 
[Devlin and Roeder, 1999] less than one for all seven diseases 
(Supporting Information Figure S3). 

We observed genome-wide significant evidence of asso- 
ciation of coronary artery disease with rare variants in the 
gene PRDM10 (P = 4.9 x 10" 8 ). The gene contained 122 well- 
imputed rare variants with mean MAF of 0.23%. Accumula- 
tions of minor alleles across these variants were associated 
with decreased risk of disease: odds ratio 0.828 (0.774-0.886) 
per minor allele. We also observed 10 genes with genome- 
wide significant evidence of rare variant association with 
type 1 diabetes, all located within the major histocompati- 
bility complex (MHC) (Table I and Figure 4). The strongest 
signal of association was observed for HLA-DRA (P — 2.0 x 
10~ 13 ), which has been previously implicated in susceptibil- 
ity to type 1 diabetes [Nejentsev et al, 2007]. Accumulations 
of minor alleles at rare variants in nine of the MHC genes 
were associated with reduced risk of type 1 diabetes (Table 
I). The only gene demonstrating evidence of association of 
accumulations of minor alleles with increased risk of type 1 
diabetes was TNXA, with odds ratio 2.346 (1.772-3.107) per 
minor allele. 

Common SNPs in the MHC have been previously associ- 
ated with the disease [Barrett et al., 2009; The Wellcome Trust 
Case Control Consortium, 2007], although fine-mapping of 



the underlying causal variant(s) has been hindered by the 
extensive LD across the region. We thus repeated our anal- 
yses in this region, testing for association of type 1 diabetes 
with rare variants within MHC genes after adjustment for 
the lead GWAS SNP (rs9268645) [Barrett et al., 2009], with 
genotypes coded by the number of minor alleles, included 
as an additional covariate in the logistic regression model 
(Table I and Figure 4). The common SNP could not fully ex- 
plain rare variant associations of type 1 diabetes with any of 
the MHC genes, but dramatically reduced significance with 
TNXA (P = 2.6 x 10- 9 before adjustment; P = 2.4 x 10" 4 
after adjustment). After adjustment, three additional MHC 
genes achieved genome-wide significant evidence of rare 
variant association with type 1 diabetes: HLA-DMA (P — 
1.1 x 10~ 7 ), SKIV2L (P = 2.6 x 10~ 7 ) and TNXB (P = 4.1 x 

io- 7 ). 

DISCUSSION 

GWAS has been extremely successful in identifying ge- 
netic loci contributing effects to a wide range of complex 
human traits [Hindorff et al, 2009] including diseases such 
as type 2 diabetes [Voight et al., 2010] and Crohn's disease 
[Franke et al, 2010], and quantitative phenotypes such as 
body mass index [Speliotes et al, 2010] and height [Lango 
Allen et al., 2010]. However, despite the success of this ap- 
proach, much of the genetic component of these traits re- 
mains, as yet, unexplained [Manolio et al, 2009]. Most of 
the confirmed associations within these loci are with com- 
mon variants, of modest effect, which well-designed GWAS 
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Fig. 3. Manhattan plots summarising association of seven diseases from the WTCCC experiment with accumulations of well-imputed 
rare variants (MAF < 1% and info score of at least 0.4) within genes (as defined by the UCSC human genome database). Each point 
represents a gene, plotted according to the observed -logirj P-value of association (y-axis) and the physical position of the midpoint of 
the transcript (x-axis), with those achieving genome-wide significance (P < 1.7 x 10 -6 ) highlighted in red. 



is adequately powered to detect. It has thus been suggested 
that much of the 'missing heritability' of complex human 
traits can be attributed to rare genetic variation [Bansal et al., 
2010], which is not well captured by GWAS genotyping 
products. The most comprehensive approach to assaying 



rare genetic variation is through large-scale next-generation 
re-sequencing experiments. However, despite advances in 
the cost-effectiveness of these technologies, whole-genome 
re-sequencing of the large cohorts of individuals re- 
quired to detect rare variant association with complex traits, 
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genome-wide, still represents an infeasible financial invest- 
ment for most research groups. 

We demonstrate here, by simulation, that imputation 
from an existing scaffold of G WAS genotype data using pub- 
licly available high-density reference panels, such as those 
made available through the 1000 Genomes Project [The 1000 
Genomes Project Consortium, 2010], has the potential to 
identify rare variant associations with complex traits, with- 
out the need for costly re-sequencing experiments. These re- 
sults are entirely consistent with other published simulation 
studies investigating the performance of rare variant asso- 
ciation methodology using imputation up to re-sequencing 
data, either from an external reference panel [Li et al., 2010], 
or from a subset of the analysis cohort [Zawistowski et al., 
2010]. Overall, our results suggest that a reference panel of 
4,000 individuals offers noticeable gains in power over 500 
individuals only when the spectrum of causal variants is 
loaded with rarer variants (i.e. in our simulations, when the 
maximum MAF of any individual causal variant is 0.5%, 
rather than 1%). In this scenario, imputation of the anal- 
ysis cohort from the GWAS scaffold can achieve much of 
the power to detect rare variant association obtained by the 
gold-standard re-sequencing strategy. Our simulations as- 
sumed a GWAS scaffold with the same characteristics, in 
terms of allele frequency profile and density, as the Illu- 
mina Human660W-Quad BeadArray. We would expect the 
quality of imputed rare variants to be improved with more 
dense GWAS scaffolds, such as the Illumina HumanOmni5- 
Quad, although this evaluation is beyond the scope of this 
study. 

We have considered a relatively simple underlying model 
for the association of the trait with multiple rare variants 
within the gene. More complex models might incorporate 
selection and / or different directions of effect of the causal 
variants on the trait. However, at present, we do not fully 
appreciate the likely effect of rare variants within a gene 
or pathway on complex human traits, although it is clear 
that the true underlying association model will impact the 
power of GRANVIL. Nevertheless, it is less obvious that 
the underlying association model will impact the relative 
performance of GRANVIL applied to rare variation derived 
from imputation as compared to that assayed through re- 
sequencing. 

Our simulation study assumes that the analysis co- 
hort and reference panel are ascertained from the same 
population, and thus are perfectly matched in terms of their 
rare variant profile. However, with publicly available refer- 
ence panels, this is unlikely to be the case. Indeed, the Phase 
1 1000 Genomes Project reference panel (June 2011 interim 
release) consists of phased individuals from multiple pop- 
ulations that together incorporate a wide range of ancestry 
groups [The 1000 Genomes Project Consortium, 2010]. One 
cost-efficient approach to address this issue is to consider 
re-sequencing a small number of individuals from the anal- 
ysis cohort to supplement the reference panel. This strategy 
has been successfully applied in identifying association of 
a population-specific imputed rare variant with sick sinus 
syndrome in Iceland [Holm et al., 2011]. 

Our simulation study also assumes that all rare causal 
variants in the gene have the same impact on the trait, for 
example, that they all result in loss-of-function of the gene 
product. The GRANVIL software makes the same under- 
lying assumption, and hence power to detect rare variant 
association will be maximised. Under alternative models of 
rare variant association with the trait, which consider causal 
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Fig. 4. Regional plots summarising association of type 1 diabetes with accumulations of well-imputed rare variants (MAF < 1% and 
info score of at least 0.4) within MHC genes (as defined by the UCSC human genome database). Each point represents a gene, and those 
achieving genome-wide significance (P < 1.7 x 10 -6 ) are highlighted in red. The panels correspond to analyses (A) before and (B) after 
adjustment for the lead common GWAS SNP (rs9268645) in the region. 



variants within the gene to result in both loss- and gain- 
of-function of the gene product, the power of GRANVIL 
will be reduced for all design strategies. For these models, 
powerful methods exist for detecting association with rare 
genetic variation [Asimit et al., 2012; Neale et al., 2011; Wu 
et al., 2011]. Many of these methods, such as C-alpha [Neale 
et al., 2011], require direct genotype calls. These approaches 
could make use of 'best guess' genotypes from imputation, 
although further development is required to appropriately 
allow for the posterior distribution of calls for imputed 
variants. 

The encouraging results of our simulation study 
prompted us to re-assess the evidence of rare variant as- 
sociation with seven diseases from the WTCCC experi- 
ment [The Wellcome Trust Case Control Consortium, 2007]. 
We were able to recover genotypes at more than 5 mil- 
lion 'high-quality' imputed rare variants, even with the 
Affymetrix GeneChip 500K Mapping Array Set as a scaf- 
fold, which would not be expected to capture variation as 
well as more recent higher density genotyping products. 
Principal components analysis identified three axes of ge- 
netic variation that capture fine-scale population structure 



within the United Kingdom. After adjustment for these 
three components as covariates in our logistic regression 
modelling, there was no discernable residual inflation in 
rare variant association statistics, indicating that any ad- 
ditional fine-scale population structure had no impact on 
our analysis. We identified association of coronary artery 
disease with accumulations of minor alleles at rare vari- 
ants in PRDM10 at genome-wide significance. This gene 
has not been previously implicated in susceptibility to coro- 
nary artery disease or related cardio-metabolic phenotypes, 
and this association signals warrant follow-up in indepen- 
dent cohorts, either through genotyping of rare variants in 
the gene or re-sequencing. We also identified genome-wide 
significant evidence of association of type 1 diabetes with 
accumulations of minor alleles at rare variants in multiple 
genes from the MHC. This region has been previously asso- 
ciated with type 1 diabetes, both through common variant 
GWAS and analysis of classical human leukocyte antigen 
(HLA) haplotypes. Further work is required to dissect the 
complex genetic contribution of common and rare variation 
in this region to susceptibility to type 1 diabetes and other 
autoimmune disorders. 
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The results of our analyses presented here have major im- 
plications for the design and analysis of genome-wide rare 
variant association studies of complex human traits. Our 
results clearly highlight the potential for the detection of 
rare variant associations by using existing GWAS genotype 
data, supplemented with imputation from publicly avail- 
able high-density reference panels, without the need for 
costly whole-genome re-sequencing experiments. Although 
imputation can never replace the gold-standard approach of 
whole-genome re-sequencing, it provides a powerful, cost- 
effective alternative that only requires a scaffold of GWAS 
genotype data, which may already be available. This mes- 
sage will bring encouragement to research groups who do 
not have sufficient funding to consider whole-genome, or 
even whole-exome sequencing as a financially viable ap- 
proach to assaying rare genetic variation. It is clear that 
GWAS still have the potential to offer an exciting opportu- 
nity for gene discovery, conceivably leading to substantial 
advancements in our understanding of the genetic architec- 
ture underlying complex human traits. 
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APPENDIX: CALCULATION OF THE 
RESIDUAL TRAIT VARIANCE 

Determine the number of minor alleles carried by the 
ith individual in the analysis cohort across all causal 
variants in the 50 kb gene, denoted m,. The expected trait 
value for the rth individual, denoted by [jl,, takes the value 
1 if nti > 0, and 0 otherwise. Assuming an analysis co- 
hort of N individuals, the mean expected trait value is 
given by 



^ N 



with corresponding genetic variance given by 



N 



Assuming that the rare causal variants explain 100\% of 
the overall trait variance, it follows that the residual variance 
is given by 



a 2 = Vr 



(1-X) 
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